DOCOHENT RESUAE 


ED 139 218 ae . BC 100 789 

AUTHOR Halpern, Andrew S. 

TITLE’. »Principles and Practices-of Measur2m2nt in Career 
Education for the Handicapped. 

PUB DATE Apr 77 ‘ a 

NOTE- 34p.3; Paper presented at the Annual [International 


Convention, The Council for Exceptional Children 
(55th, Atlanta, Seorgia, April 11-15, 1977) 


EDRS PRICE MPF-$0.83 HC-$2.06 Plus Postage. 
DESCRIPTORS *Career Education; *Evaluation Crit2ria; 
*Handicapped; Legislations *Measurament Techniques; 
Norm Referenced Tests; *Performanc2 Criteria; 
Performance Tests; ¥*Testing 


ABSTRACT : 
; Discussed is testing in the field of career education 
for the handicapped, with emphasis on four major topics: applied 
performance testing, criterion validity studies, product vs. process 
measur2ment, and criterion vs. norm-referenced measurement. The 
author reviews some political considerations relevant to this area of 
testing. (IM) : 


JESSE EIEIO RIO CIOIOISIS HOI IIIS IOI ISIS CIOS INI IO III IOI IOI IIA 
* Documents acquired by ERIC inclui? many informal unpublish2d * 
* materials not available from other sources. ERIC makes every. effort * 
* to obtain-thé best copy available. Nevertheless, it2ms of marginal ‘* 
* reproiucibility are often encounterei and this affects the quality * 
* of the microfiche and hardcopy reproiuctions ERIC nakes available * 
* via the ERIC Document Reproduction Service (EDRS).-EQRS is not * 
* responsible for’ the quality of the original document. Reproductions * 
* suppliei by EDRS are the best that can be made from the original. * 
IIE ISO IRI IIIO GIGI III IIIOII ISIC IOIIOIIOIIO ICICI IIR IOI IORE ACAI ICH 


‘ 


US DEPARTMENT OF HEALTH, 
EDUCATION & WELFARE 
NATIONAL INSTITUTE OF 

EOUCATION 


4 * 
a ‘ . THIS DOCUMENT HAS BEEN REPRO- 
. OUCED EXACTLY AS RECEIVED pein 
A THE PERSON OR ORGANIZATION ORIGIN- 
. ATING IT POINTS OF VEEW OR OPINIONS 
. ; STATED DO NOT NECESSARILY REPRE. 
* SENT OF FICIAL NATIONAL INSTITUTE OF 
; EDUCATION POSITION OR POLICY 
a 


e 
ee 
& 


COLLEGE OF EDUCATION | 


CENTER ON HUMAN DEVELOPMENT 
s 


REHABILITATION RESEARCH AND TRAINING GENTER 
IN MENTAL RETARDATION 


veel 


. 


UNIVERSITY OF OREGON, EUGENE 


s 


THE HANDICAPPED 


a 


April 15, 1977 
% 6 


EL/0 a PSF 
‘ 


- 


— a me Ce Bars “ Boa _ a = 
2 , 
e ase, ie 3 


ee 


aie __ PRINCIPLES AND PRACTICES j 


OF MEASUREMENT IN ! 
CAREER EDUCATION FOR i 


Andrew S. Halpern 


= 
— ee 
Gate 


> » Introduction 
. EEE eee < r 


Fe if There are many measurement methods, both formal 


Ss 


A+ and informal, that have evolved within the field of 
career education for the handicapped. niese methods 
include the interpretation of historical records and 
interviews, third-party viditatinie dibect observa- 
tions, and direct testing of behavior. | | 

This presentation wit focus entirely upon testing, 
which is the most ubiquitous, form of measurement in the 
field. Four major topics will be discussed: 


ota 
applied performance testing; 


criterion validity studies; 
e é 
product vs. process measurement; . 


. x 2 * = aes 3 * e's Se: See - 7 ~- 
‘ é : a Lee = : 
- . 
~ 
- : 
. 2 = . FSi 
* ae ONG 
é . iS . 
“4 
is “ 


i. exude 
| 4, pa 3 cil vs. norm-referenced measurement. 
| " : POLITICS five presentation will conclude with some recent political 
considerations that are relevant to the field. 


1 , on, Applied Performance Testing 


{ ue recent years, there has been a strong plea from 
i > . several sofirces to,focus our test development efforts 


upon '"'criterion sampling" (McClelland, 1973; Lavisky, 


J Bie deen ‘ _ 1975; Kulman, Et al., 1975). If one is interested in 
ue predicting whether or not a person will succeed as a 
eae 20 aoe 
a 7 5, uy welder, why bother measuring the speed and accuracy with 
0 : 


which he places pegs in holes? 


3 Va 7 


ii 


~One of the more adament sources of this appeal can 
be found in -the literatur® on applied performance testing 
(Lavisky, 1975). At its most basic definitional level, - 
applied performance testing is accomplished when a 
desired behavior is measured in the context in which’, 
performance is expected. One step removed from 
contextual relevance would be ASaisitrenent of‘ the 

desired behavior. in a simlated context. .A second step 
removed would be measurement of knowledge about the 
"desired behayior,: . 

Shin some instances, cognitive behavior or knowledge ~ 
is an end in itself. When this occurs, measurement of 
knowledge can itself be construed as applied performance. 

6 | Some of the work sents that, have been* developed t | 
for vocational evaluation of the handicapped are goad ° 
examples of applied performance testing, “the Vocational fr. 
Evaluation -and Work Adjustment Association offers the 


following definition of a work sample (Kulman, et al., 


e 


1975) ,-as: . 


.a well defined work activity ‘involving tasks,. * 
materials, and tools which are identical or 

similar to. those in an actual job or cluster ‘) 

of jobs’... The work sample should : 
simulate the complete range @f\work activities: 

of which a particular job or occupational group 

is comprised. (p. 55) 


~ , ; 


vw) Not all work samples are as clearly related to 


criterion behaviors as this definition would suggest. 


Some involve criterion sampling only in.a rather 


as 
= 


abstract sense, since the tasks are performed in a 
context far removed from an actual vocational setting. 
- g The appeal of applied performance testing lies not 


"only in the common sense desirability of ‘measuring what 


EACHING | really is of interest, but also in the assumption that 
oaeie ’ \ measurement of criterion behavior has much clearer + 
> 


relevance for instructional intervention than 


os 


measurement of sqme surrogate for the criterion. 


at 
° 
~ 
Te 
=~ 


This philosophy of diagnostic/prescriptive measure- : 


?¢ 


. ment and intervention has alggfBt become a religion in‘ 
4. 
@ today's educational community, particularly among 


educators of the handicapped. There is much to be said 


| Sp 
+ 


i ‘ 
a / for this position. /Before totally joining the bandwagon, 


\ : however, it is worth mentioning some of the precautions 
; Cc A that were eetentty raised by a proponent of this position Re 
UII9 (Joselyn, 1975): | } 
: N - Large amounts of time and energy are often i. 
: Z 9 required for the production and measurement 
~ po y= of behavioral objectives. 


«@ No single list of objectives is likely to serve 


the needs of many students and teachers. 


4 
~ 
a 

ss 


N vas \ 

_ + Easy-to-measure objectives may get chosen only . 
; 7 because of this: technical advantage, resulting 
ina neglect of more velsvant but harder to 
— objectives. | 

If accountability is a major consideration, 
teachers may choose some objectives only 

because they are easily achievable. ; 

- Focusing on precise educational ob: Lines tends 


to detract from consideration of broad educational 


“ y 
wy 
. 
\ 
{ e = 
; ’ 
va 
"We 
q 
cz 
‘ 


F ye . 
~ . goals, which often leads to an sia ig that \ 
F % . Si 


‘ 


a ee 


IO). one examines this list of precautions, one major factor 
e \ 


at a emerges as the cause of many of the problems. - Life is 
simply too complex to, specify all the precise behavioral” 


\ < ‘ “ 
criteria that are desired outcomes of our interventions 


put 2 
a | 


4 
| q ; will almost always require a Zampling of behavior, 
raises the dual questions of the generalizability | 
. | * the pertinence of our—Ghdings, to related pace and 
op" 


. to other contexts. f ; 


eek ; bd 
t F Criterion Val idity Studies 


if we depart from criterion measurertent , or 


- 


measure only a sample of criterion behavior, we are 


; | 6 
= Oe | \ 


existing goals are adequate. \ - te 


Most tests are subject to this requirement. 


I have chosen three that seem to be somewhat 


forced to raise the question of criterion validity, 


Of the 


« 


many that exist in the field of career education, 
sa 


representative. One ‘examines the relationship betwéen 


Ss | 
SS . | 


general aptitudes, and vocational criteria in a 


non-handicapped fopulation. Another examines the ¥ 
' i] 
relationship between work samples and vocational 


criteria in a handicapped population. The third 


oe 


examines the relationship between knowledge and pre- 


vocational criteria in a mentally retarded population. 


la/ General Aptitude Test Battery. The General Aptitude 


Test Battery, or GATB, is probably the most highly 
researched vocational evaluation instrument in 


existence. It is used primarily for the purpose of 


) 


training| programs or jobs. 


a to help place individuals in vocat }onal~*) 
; y? 1 
The battery consists of 12 tests yielding nine \, 


aptitude scores. The scores and their corresponding ¥ 


aptitudes are shew in Figure 1. [13 Se 
14 The criterion validtty of the GATB oy different | 

jobs or occupations is determined through the deve lop- 

ment of "validity noms.'' This process entails a number 

of well defined stepe: beginning with the selection of 
od 


an "experimental sample" consisting of people in a given 


~ 


. Z . 
i e 
“ 
4 ta . , 
{ . ” 
f c 
} 
; Z ‘ 


ee ou occupation of preparing for a given occupation. The 
y wore _ GATB is then administered to this sample, in order to 
(4 identify those aptitudes with high means and low 
standard deviations. “Criterion measures are-next 
obtained, usually involving either production records 


| or performance evaluations by supervisors. The final 


po » relevant aptitudes, and then experimenting with ' ~* 
multiple cutoffs within each aptitude in order fe: 
discover which deeanneient maximizes the number of _ 
people who are either high scdrers on wane aptitudes 


and the criterion or low scorers on both the aptitudes 


‘ 


é '/ and the criterion. An example from the GATB Manual 


(Manual for the . . ., 1970) illustrates this process 
a Cate Worker ie 

G - Intelligence iS / ' | 

V- Verbal intelligence The experimental sample for this study consis 

W- Numerical Aptitude 
Q- Clerical Perception 


IS 


for the job of case worker. 


1 - step in the analysis involves dichotomizing the | 


a2 


of 106 case workers: After identifying G, V, N, and ’ 
Q as potentially relevant aptitudes, the multiple 
‘Ss 
- cutoff analyses suggested that. only G, V, and N 


*be retained. The relationship between test ‘scores 


|. and criterion performance is shown in Table 1. [ WS 


: was indeed statistically. a But what can we 
7 al 


‘ 


“ say of its. practical significance? [Twenty-one percent - 


ra A 4 A 
i 1° . ‘ As this table shows, the obtained phi coefficient 


Relationship Between Test Norms and 
Dichotomized Criterion -Case Worker N= 106 


le My 
to be poor workers. / More potently, over 50% of those, 


of those who earned qualifying test scores turned out 


who did not earn qualifying test scores turned out to 


ailing cms agnor team amen me ian 


be good workers. How confident could we be, on the 
basis ef. this etuati; in using GATB scores to advise a 
wah as to whether or not. to become a case worker? 

TOWER sien / Our second example of a criterion 


validity study comes from the literature on work sample 


‘evaluation with handicapped people. The TOWER System, 
' developed at the Institute for the Crippled and . 


t t i 
Disabled in New York City, is the oldest and best | 
known work sample evaluation system that has been 


developed for use with handicapped people (Testing , _ 


Orientation and Work Evaluation in Rehabilitation, 1974). 

It consists of 94 work samples that have been clustered 

into 14 pécupational areas (ICD Rehabilitation and 

Research Center, 1974y. : | 4 
In 1967, a report was published describing a fairly 

comprehensive attempt to assess the criterion validity of 

the TOWER Syston (Rosenbergy 1967). In this tier, 

relationships were examined between work sample 

evalyation scores, the "intermediate" criterion of 

shirisew vikings of performance during training, and 


the "ultimate'' criterion of acquisition and retention 


of a job, , 


oa + a 
"* 
Seeset al . 
‘ 
" ‘ fe i 
y 
- ’ ~ 


BBO 
N 


\ne primary predictor variables of interest were ~- 
the "performance rating" an&_ the patty rating" that 
| : are deri ved from each work sample evaluation. Both | 

. ravines used a five-point scale, summarizing the 


, 


eva}uator' s prediction on the likelihood of successful 
training and subsequent employment in occupations 
pepecconted by "the work sample. % 

When work sample evaluation was completed on the 
experimental = eens types of disposition 
occurred : 

- the client rece votational _taaining in a 


7 trade area; ‘i 


é 


- the client received ‘training in unskilled i 
workshop activities; | 
- the client received direct placement at a job; 
3 a the client was closed as not feasible, | 


e ; 
, Subsequent analyses involved only members of the first 


- | three of these groups. 


y 
. ae oe z 
5 - ‘ 4 4 
. ‘ i Par r 
. 
~ : 
‘ x bd Ss 
. 


Be 
2o/ The project was completely unsuccessful in . 
demonstrating criterion validity for the TOWER System. 
-\ ; 
es For the most part, very low relationships were found 
Low Detatenstiy * . 


between the predictor and the criterion variables. 


Mac 


The author of the project report offered several 


. ‘= % 
\ 
. 
, ’ 


explanations for these findings: 


"FF 


ae 


- 


. rs x « 
mH HH HEH HEH He HE HE HE ES fE. es 
' a. 
ba . = > 
7 . Z ¢ * 
® = ‘ be 


al 


Not being able to randomly assign subjects 


% 
‘within each disposition greatly restricted 


the range of predictor scores within each 
disposition. The testiicted range, in turn, 
set statistical limits on the possibis 

magni tude of the correlation coefficients. 
inter-rater reliabilities were not established 
for either the predictot or the-critericn | 
variables. Both sides of the equation were 


believed to be somewhat deficient. 


- The administration procedures in the TOWER 
’ System were not fully standardized ,, requring 
many clinical judgments on potentially different 


behavioral samples. 

The quality of training provided to clients, 
varied in an unknown manner. 

The follow-up procedures were inadequate 


and produced limited data. 


Ql / Social and Prevocational Information Battery. 


Sociol & 
Prevocational 
information 
Battery 


The third example of a criterion validity study is derived 
from oe own work at the Rehabilitation Research and~ \ 
Training Center in Mental Retardation at the University 
of Oregon. 


ship between knowledge and criterion behavior in several 


social and prevocationaI domains. 
‘ 


a 


Le 


This geries of studies examines the relation- 


- 
er 
- 
wy 
. 
ny 


ta 


-Y . “During the past five years, our Center has been 
involved in the articulation of a measurement ‘strategy 


and the creation df a standardized series of tests 


: ae fo 2 * - 4 er . 
5 aes e # . = ” 
=a Zz - we mm fehl ‘i aaa 7 | La *] Ss i Re ae se = mz t as ae | sor 
« . ‘ - 
Ps Fe s .* - . * yx 
. ° = * Fe ‘ ps 
« , eof . 
‘ : : to. g 
Pte: ¢ = ’ 
% ers 
, : 
. ‘ © . 
’ ro 
ia 


S49" . known as the Social and Prevocational Information Battery 
(SPIB) . e SPIB was originally developed for use with 


mildly retarded people (Halpern, Raffedd, Irvin; §& Link, 
a 19758; Irvin § Halpern, ‘In press), and has bers available 3 
gk: “SE published form since September, 1975 fea peat 
Raffeld, Irvin, § Link, 1975b). A major revision of 
the SPIB for use with moderately retarded people has 
recently been completed and is: available now (Irvin, 
Halpern, § Reynolds, 1977; Irvin, Halpern, § Reynolds, 
In press). . 7 
: 22I The domains measured by the SPIB "relate to five 
broad areas of adult community y adjustment: employability, 
22. | economic self-sufficiency, family living, personal 
habits, and commmication. Within these five broad 
areas, nine separate tests have been developed, a es 


measuring the examinee's ‘Knowledge of job search skills, 


job related behavior, banking, budgeting, purchasing, 
home MANAgeMeNe, health care, hygiene and grooming, 


and functional signs paieal reading) . 


y J3} oe field testing produced 277 items, either 


t 


true/false or picture selection in format, which are 


. 
— ees = 
2 


a 


‘12 


: 
' 


\ i “| 


tt . 


orally administered to examinees in order to neutralize 


: — ae - cae 

& - 2# 

a me = % 

oe 
. ’ 4 

‘ a 
‘ —_ 
s \ 


ee - the impact of differential reading ability, teach 
of the nine tests contains approximately 30 items that 


can be administered separately, requiring 10 to 15 


. 


minutes for completion. . 


as/ Items were selected in accordance with a domain 


sampling model, whereby each domain or area to be 
tested was further specified hierarchically into 54 
content areas and 180 sub-content areas across the nine 


SPIB domains. The sub-content areas then served:as a — 


or 


‘blueprint sor senereting test items for possible 


inclusion in the battery. :-* 


\ 


. 
i. 
e 
v 
‘ . 
« 


~ * Three studies have been conducted to examine the 


 epemnaie 


relationship between SPIB performance and applied 
performance BH order to estimate predictive validity, 
a sample of vocational rehabil<tation clients was rated 
by rehabilitation counselors on’ 29: behaviors in the 


broad areas covered by the SPIB. This same sample had 


— - . 


tac 


been tested on the SPIB one year earlier, just prior to 


graduation from high school. 


‘ges 


3! / The concurrent validity of the SPIB was also 


estimated with a second sample of vocational rehebilitas 


ie 
cz iE. 
a A 


tion clients. These clients were tested on the SPIB 
Af 


4 


ae. bee rs 


= =z : 
. 
» 


> 


| a ae 
=, 


ed ar a. a mm. 
. 
af . 
. E 


wz 


T on ro PE es 
uy sai 
Ped ‘ ead =—= me 4 — - ‘4 
os . 
= ¢ . ‘ 
7 i 
z . 
: Ae ; 
P 
7 ! 
‘ : 


red 
, 
‘ 
‘ 
4 
e ‘ 
Ss 
oe 
i 
‘ 
ag " 
i | 
24 
ze 
* 
i 


~~ 


t 
¥, 


a 


+ 


and rated by sii counselors within a three- month ba 
; period of tine. : : ms o 


3A [ The concurrent validity of SPIB-T was estimated 
\ B 


. 32 - within the, sample of residents from community facilities. 


The “applied performance scale, in thfs study, consisted 
of 87 “items that were constructed parallel directly 
the content: of the SPIB-T. Test ing and re were 
both accomplished within a three-month, period. | 
a 33 / A variety of correlational analyses were performed’ 
\ within each of the three validity studies. Some of the 
most interesting results were provided by canonical 
correlations, using the nine SPIB tests on one side 


of the relationship and the major sub-scales of the 


34 | criterion instrument on th® other side, {These analyses 
produced canonical correlations of .58 in the SPIB 

35 : ~ predictive sce Fc [ot in the SPIB concurrent 

34 > validity study, and | 75 in the SPIB-T concurrent 


validity study. These results strongly suggest that 
knowledge and applied performance in the SPIB pains 
- are related ‘to one another. 
~ Discussion of the validity sees Ne one 
considers the findings of the GATB, TOWER, and SPIB 
studies together, the overall snptateal is far from 
encouraging, In a tightly controlled study, the 


relationship between general aptitudes and vocational 


“14 


“4 


% 


performance did not appear-to be very strong. 
Since the a Se variable involved performance 
ratings of unknown reliability,\it is possible that 

the low relationship sale Gapited pavbint iy by measure- 
ment error. “On e other hand, it seems quite likely 
that the general magnitude of the eeporind relationship 
is accurate, implying the vocational success requires a 


‘ great,deal more than skill in a number of very general. 


wid 7 "> z 
+ id 2 
a ‘ = . 
ae 
* * . i 
. F . * te i ‘- - 
. ‘7 . 2 ~~ a 
° - .: 


$ ' vocational aptitudes. But didn't we really know that 
all along? 


The TOWER project, when it began, must have been 


, 


Pa 


. 
« v 
. . ae 
. 


seen as providing ‘an donsenundes for a major breakthrough. 


After all, people had been arguing for some time that - 


ae work samples should provide a much tier indication 
; tf . 

of work performance than scores derived from paper and” 
pencil tests that measure fairly abstract: abilities. 


But the study did not confirm this highly believed 


‘= 


bi ah 


; . hypothesis, and the author was quick to point out the 


\ ' 
many methodological weaknesses that most certainly 


_ 
5 


é contributed to the findings. And so we are left with- 


a question still to be answered by future research. 
Recent opinion suggests that this question may 
actually bé-a dilemma, for it seems to hinge in part 


on the following issue. Work sample evaluation in the 


ay 


ad 


fi “* = . 


. 


rd ¥ % 
- ees) 
: 
' 


~ 


A « ae 
fs iz | 
e 
4 . 


: = a le, ' 
mas Boe <4 ba ae , a . 
° 


ar 


2 


ms 


c 


rs 


rig 


~ 7 


4 7 
eT} 


| 
U ‘ 


past has deliberately remained flexible in order to ? 
avoid some of ‘the pitfalls of standardized assessment 

that tend, to. lower examinee motivation and raise examinee 
anxiety. } the price of; this flexibility is ‘increase@ 


RE ASUnestN error, perhaps even to the point. of making ‘ 


; work sample évdluathod ira as a formal testing} 


nechand <i, i i a ep 

The SPIB studies ay encouraging, shicheiien 
moderately strong relationships between measures of * 
knowledge and applied performance in several social and 
peerocationd domains.. Several methodological factors ) 
seem, to have facilitated these findings. First, the 
reliability of both ‘the predictor and- criterion variables a 
were astertained ‘to be- satisfactorily high. More 7 
importantly, the” items for both the predictor and’ 
criterion measures, although different in format, 
were drawn from the ‘same general content domains. 
This ae an exahiination of the relationship “ 
between icweledce aa \epplied performance in the same 
domain, a much more mensorible expectation than knowl- 
edge in one domain predicting applied performance in 


an entirely different dbnain. 
\ . 


Process or Product of Léeaming * 


38e psychological measurement in the past -has 
' 
focused only on the products of past learning, assuming 


P 16 


~ 


. 


» 


- = =e ote im ; pened a Ra i 
mm 
- : 7 sis c . 


= = 


‘means of acquiring information are integral parts of 


ve 


_ that the opportunities-to leam have been adequate. 


Test. items are concerned with what an examinee can do, 
rather than with what is required to guarantee success. 
‘Several researchers in the field of mental retardation, | 


however, have been arguing strongly in recent years 


_ that we really ought to be at least equally concerned 


with measuring the process of leaming (Haywood, et al., 

*1975; Gold, 1973; Budoff § flamilton, 1976; Kulman, et ais 
1975). If our primary question, as Iydoff and Hamilton’ 
(1976) suggest, is to ascertain a person's ability to 
leam and-profit from optimal instructional experience, 
then our test formats should include oppor ities for 
examinees to learn and arackiee what they are expected 
to perform. 

This sentiment has also been strongly expressed by 
the Vocational Evaluation and Work Adjustment Association ' 
(Kulman, et al., 1975), In discussing the strengths and | 
weaknesses of work sample evaluation, they. state that 4 
testing should always be preceded by some degree of 
training. In their'words, "if a client does not ‘ ml 
perform adequately feldoving standardized industrial | { 
instructions, it is necessary to ‘determine what type(s) 
of instruction will facilitate his understanding of a 


the task. bo The evaluation of the clients' 


ability to lear, their retention, and most efficient 


the total assessment process" (p. 59). 


17 


. 


+ 


* 


ay,” x i 


34 | Although the research on work safmple evaluation 


are “docs not yet illustrate this principle, Budoff and 
4 


enti - Haywood in this country and Feuerstein in Israel have 
ail 


ing 


Lear" conducted several series af studies that have come to 


« 


be known ‘as "learning potential" research. Their 
4 


; 34 ~ findings strongly suggest that tests which include 


A 
€ 


e 


measurement of learning process will be more predictive 
of fies performance than tests which measure ay 


the current products of past learning... Although these 


, 


studies have focused primarily upon the measurement of 
general cognitive abilities, their es and findings 
¢ have implications for the more appliéd domains of 
vocational and prevocational assessment. 
y 4O/A recent example of such an application is found L 
in a test called the Trainee Performance Sample (Bellamy § 
a aoe Snyder, 1976). me TPS was designed to predict the 
Performance practicality of training severely and profoundly 
mane retarded adults in a fairly narrow range of light 
industrial tasks. Notice that the prediction is 
Lo - concerned with practicality rather than feasibility, 
Sineé the criterion behaviors are known to be feasible. 


The purpose of testing is to predict the level of 


+ y 
resources that will be requiired*to achieve the 


training objectives. a 


The TPS consists of 30 items that each represent . 
_a learning operation in a vogational ‘context. Testing - 


ba 
involves training, including the opportunity to profit 


from correction. Preliminary reliability and validity 
data with sis instrument are very encouraging. 

4) | A more ‘radical approach involving measurenent of 
learning process calls for the full integration of i 
training and evaluation, When this occurs, each Aes on | 
of the training process is monitored, and trainee 
performance provides immediaté feedback concerning bi 
the effectiveness of training procedures. If the results \ 
are-not satisfactory, procedures can be revised until 
the trainee acquires the desired i ai (Gold, 1973). 

As the description of this approach implies, its utility 
if generally restricted to measurement issues and 


- 


problems that arise within the context of training. 
42| There is a third, perhaps even more radical, 7 
woabent in which the argument for measuring learning 
a has been promulgated. One example of this 
pasitinn comes from the literature on career education 
for the handicapped. Dunn (1973) argues that most of 
our current carcer education programs provide students 


with information, rather than teaching them how to find , 


and acquire the information for themselves. Instead 


bd 


! 


(19 


- 


. 7 oe 
. i ay 
mm HH HE EF FF FB 


* ? 


Ld . 


+ « OO 
Rs a 
¥ 


a 


: CRITERION 
Soars =< - eo 
. 


‘Meaning 1 = Ultimate, Behavior 
Meaning 2 = Standard of Performance 


44 


a 


of the prevailing approach, Dunn suggests-that career te 


education programs should focus on decision-making, 
which has two components: acquiring information, and > 


‘developing and implementing a strategy ‘for processing \ 


the information. The startling implication of. this 


philosophy is that measurement would focus almost 


° +” 
exclusively on the process of learning, draw | upon - 
particular products only in-so-far as they’ illustrated 
key components of the process. The work of-Goldstein 
and his associates in development of the Social hla og 
Learning Curriculum provides an excellent example .of . 
this principle. htt co © 
Norm-Referenced or Criterion-Referenced? : } 


fd 


43 | One of the most interesting measurement issues to 

emerge in recent years has been the argumentation in 

support of criterion-referenced tests as A supplement 

to, if not a replacement of, nom-referenced tests. 

Not infrequently, there has been more heat than light 

shied during presentation of the arguments 4 The source ‘ 
of the confusion lies in the anbiguity of the word 
"criterion," which has two completely different | 
meanings. The first meaning, that we have already 


discussed, refers to the validity of a test in terms 


20 


a ‘y 
. i 
. 
. ° # 


~ 


- 
~. 


‘~ 


o ‘ 
’ . 
' 


= 


Instructional Motive 


45 


‘la 


of the extent. to which the test measures directly or 

indirectly the criterion benbya0r of ultimate. concern. 

re ’ This meaning of criterion is Seopintive: The second 
meaning of the word "criterion" refers toa critical 
level or standard of performance. This peaning of 
criterion ds evaluative. ; 

The two major motives that lie in back of the 
criterion-referenced movément are derived fron the 
two different meanings of criterion (Donlon, 1975). 

,The first motive involves the desire to pest people 
on instructionally relevant dimensions; i. e., the 
criterion behaviors thenselves The second motive 
involves the desire to eyaluate test scores on the 
basis of performance itude rather than ‘relative 
position within a group. 
76 Let us consider first the instructional motive. 
Several quotations are i lustrative. Hively. (1975) 


suggests "the most important characteristic of domain- 


referenced testing is that they [sic] prowide students . 


with clear opportunities to try over and over again + 


achieve well-defined areas of skill. Nom-referenced 
testing systems do not provide such opportunities". 
‘(p. 5). Reynolds (1975) states ies "in today's 
cbiteiat the measurement technologies ought to become 


1 


; 21 


a Jas integyjal parts of instruction designed to make a 
‘ in ‘ 7 : : ; 
1 ‘ : diffefence in the lives of children: and not just a 
‘ - 


ee ; prediction about their lives" (p. 15). 
ae. » 36h Both of these quotations allude to the fact that 


. 4 


| ae am 
-. ry 
ie. a = 
~~ | 5 
. 
+ 
~ 


Lad LAvdee 


' most norm-referenced tests have been designed for the 


| 


purposes of selection or Classification, rather. than 


for the design, modification, or evaluation of | 


ye Seg 


instruction. Although this is clearly an historical ’ 


ts 


trend, it is not an inherent necessity. Any test may 
be constructed in a way that samples criterion behavior 


in thé descriptive sense of this word. Furthermore, 


s2* 


«~ 


=e EE EE EE 
a. ’ 
¢ 


selection and pata aa their proper place * 


along with the instructional purposes of measurement. 


Y 46 The evaluative meaning of criterion offers some 


( 
deeper issues for consideration. Any test score has 


both descriptive and evaluative interpretation. The 


descriptive aspect of a test is simply its raw score; 


i.e., the number of items answered correctly, The 


: Evaluative Motive 
evaluative ‘aspect of a test is the interpretation 
YL placed upon a score indicating level of success. oS 


Both criterion-referenced and norm-referenced tests 
produce raw scores, which might even be derived from 
the same test items. In one case, evaluation is tied 


to the performance of one or more reference groups. 


s 


22 


; # . . ] _. * 

gt! _ ‘In the other case, evaluation is based upon an arbitrary 
_ judgment (Donlon, 1975). eae 

' Neither approach presents the entire picture, or, 


stated more positively, each approach presents part of 


V7 
the picture. has main viEDS of performance magnitude, 


~ 


a Cuiteron Referencing _ > "BB an evaluative dimension, is its. direct interpretability 


Naguib as sehibeveimenits i examinee who respond correctly to ,° 
95% of the items on a test may have performed well. 
st Adequate 1s S08 Or 859? “Inevitably, the 
judgment is somewhat arbitrary if test performance and vee 
implicit oe are the only frames of reference. 
48/ An added perspective may be obtained if an 
segs : examinee's test scores ape compared to the 
performance of an appropriate reference group. Nom- 
“SOR Norm Referencing referencing, however, usually produces an index of > éa2 
4¢ j . relative performance, such as percentilés, which if 
ots considered alone, provides no infétmation at all about 
- magnjtude. Furthermore, if one reports test scores 
only in relative terms, this may actually mask the 
measurement of growth. An individual may grow in 
Magnitude over the yee, but never improve his relative 
position with reference to a norm group. Growth in this 


context goes: unnoticed, which leads one to agree with 


: Donlon (1975) when he states that "from an individual —_ 


Magnitude 


. e. 


point of view, a reat deal is wrong with a nunbet 
that tells you not how much you did, but how sly 

you outdid" (p. 33). fly , 
4g Given these pitfalls with both app - 


a 
evaluating test scores, it seems wor: 
criterion referencing as an att 


arbitrarily the ideal level of 


nom referencing to serye such gd role adequately, test 


scores within the reférence gfoup will haye to be 


position. 


Solmere is 


bility of an examinee's performance (Donon, 1575; 
sner, 1975; Hofmeister, 1975; Joselyn, 1975). Rosner 


states this succinctly when he points out that a test 


24 


€ 


@© 1964 Civil Rights Acts 


Pate bo oe ae er ne 


Cte oa se ae Me 


wad SS 7 Pn ad 
Fo Ne wn ewe wee — 
e Sie Ge pheeqeenn 8 TO POOH * 
(Title 7 loss + oot 

BCS se eres 
Bile @ tree 4 ee ee ee 
ite @ wee me ee Le —_— 

® how. 


. * 4 
- a ° aye 
. - . é <4 ry 
. - - > 
- . . 


7 
’ referenced and norm-referenced tests. 


- Enter the Politician 


oe 


"mist assess a generalizable behavior -- a skill that 
transfers within a specific domain. . . Teaching to 
the test is a worthwhile enterprise only if what is - 


learned will be apparent in other situations which are : 


similar to the test in certain ways but involve different> .. 
: : ‘ 
stimuli and/or contexts" (p, 45). This concern with A 


generalizability is relevant for both criterion- . 
fs 


S) Most of the issues presented thus far have been 


concerned with the principles and practices of measure- 


~ 


ment. During the past decade, however, a political - ° 


dimension has emerged that is likely to rattle more 
cages than all of our professional issues combined. ~*~ 
It seems fitting, therefore, to.close this presentation 


with a discussion of politics. 


=] Ye all began with Paseape-of the 1ded Civil Rights 


Act. Title VII of this Act stipulates that employers, 
labor unions, and employment agencies may not discriminate 
on the basis of race, color, religion, sex, or national 
origin. Of particular eeievenes to the field of mieasure- 
ment is John Tower's amendment to Title VII, which 

reads as follows (Koenig, 1974): 


- 


25 


er 


Nor shall it be an unlawful employment practice 
for an employer to give and to act upon the 
results of any professionally developed ability 
test, provided that such test, its administra- 
tion or action upon the results, is not designed, 
intended or used to discriminate because of race, 
color, Betigneny sex, or national arsgy. 
-@ In order to implement Title VII, the Equal Employment 

& . 

Opportunities Commission (EEOC) -was created. 

' As Koenig points out (1974), it is tot entirely — 
clear whether the purpose of the Tower amendment was’ to 
prevent discrimination or to make it legitimate under 
the disguise of scientific objectivity. In ‘any case, the - 
sale of psychdlogical tests to industry boomed during 

_ the late 60's, to the point where some referred to 
Title VII as the industrial psychologists' Guaranteed 
Income Act. 

a 53 In 1970, a somewhat ih. event occurred in the 
publication of Title VII guidelines by the EEOC (Federal 


Register, 1970). It was not the mere publication of 


these guidelines that was amazing, but rather their 
strength. Consider the following examples: 

- Sec. 1607.3. The use of: any test which 
adversely affects hiring, promotion, transfer, 
or any other employment . . . . opportunity 
[for minorities] constitutes discrimination — , 
unless (a) the test has been validated and | | A 


evidences a high degree of utility. ... . , | 


26 


G La , 
ms t 
us ’ 
pt 

Ty 

’ 

" ; 
bate 

aa ‘ 
| ame y 
a: « 
| oer 


«te 


" 


. 
+ 


a Ea Ea WW ES - 
« 
ws 
2 


; 


i. 
aS 


+ 


r 


angie 


and (b) the person giving or acting upon the 

results of the particular test can demonstrate 

that alternative suitable hiring, transfer, “7 
or promotion procedures are ynavailable for a : 
his use. >. | 
Sec. 1607.4a. Where technically feasible,a 
test should be validated for each haat 
group with which it is used. 

Sec. 1607.4c. Evidence of a test's validity 

should consist of wupirical data demonstrating” 
that the test is predictive of or significantly 

_ Correlated with important elements of work 

behavior which comprise or are relevant to the 
job-or jobs Ei which candidates are being 
evaluated, | 

Sec. 1607.7. Any person citing evidence fron 

other validity studies as evidence of test 

validity for his own jobs must substantiate 

in detail job comparability and must demon- 

strate the absence of contextual and sample 

. differences. [between his applicants and the . 

sample which spavided the validity informa- 

tion. ] . - 
Who would have believed in 1964 that: such stringent 


guidelines were to be created? : Few tests could even 


come close to ‘complying. ey 


1m a Ha Ez 2 = 
<< : Fs es % < Fed . io a 
‘ Es 2 
e 


3 
oy . 
o hs 
AG 


27 . 


s oo La DP a: = ¥ . ete 
+ 
+ B % ry a ~ z ees 
° 25> a “SY ay 2 = . “ 
. = , . 
2 4 . i cay cia 
“ < 
s 4 Fe 
: « 2 “ 


= 


% 


“ : oe se f * 
se + 
r % . 1 
4 cc 
rs - . 
4 r 


For a while, the EEOC guidelines contained more 
bark than ie: But on March 8, 1971, the Supreme 
Court radically changed this picture in its ruling on 
Griggs vs. the Duke Power Company. | 
e This case involved a class action suit against the 
Duke Power Company with respect to employment practices 
at its Dan River station in North Carolina (Koenig, | 


1974). -The company divided its labor force into five 


divisions: labor, coal handling, operations, maintenance, 

and laboratory and test. Labor involved basically 

janitorial ~—" and was the division to which all 

but one of the company's 14 black’ employees were 

assigned in 1966. Incidentally, the maximum wage 

paid to a black employee was $1.65 per hour, whereas 

the entry wage for white urstovess was $1.88 per hour. 
After July 2, 1965, the company decided that black 


people could be assigned to the coal handling division, 


provided they held a high school diploma and earned 
passing scores on the Wonderlic Personnel Form, a 

test of cognitive ability, and the Bennet Mechanical 
Comprehension Test. Grieesaryuel and Supreme Court + 
concurred that these conditions had little to do with 


ability to shovel coal. 


28 


. es i : ., 5 / The impact of the Griggs Antisionwee to reinforce 
| % | the validity stipulations of the EEOC guidelines. At 
ee this point in time, there is no direct extension of 


Title VII to employment opportunities for the handicapped. 
Indeed, there are Sections 503 and 504 of the 1973 


eaanaeneset: i : = Vocational Rehabilitation Act which prohibit federal 
55 aa contractors from discriminating agdinst the handicapped as 
a and which generally prohibit Miscciaitation against the 
handicapped by any recipient of federal funds Thus. 
i far, however, the issues surrounding measurement of 


r 


“handicapped people as a potential source of discrimina-_ 
tion have not been translated into regulations. In my 
opinion, we wait all be viell advised to deal carefully © 
with these issues as a profession before the impetus, 

a 3 a the control, is taken away from us by the 


* “péliticians. 


mB 
= 


? 


z . : 
- j - 


Aptitude 
G - Intelligence 


V- Vagal Intelligence 
N - Numerical Aptitude 


S - Spatial Aptitude 


P - Form Perception 


Q - Blerical Perception 
* 
K - Motor Coordination 


F - Finger Dexterity 


M - Manual Dexterity 


» 


Figure 1 


Test 
Three dimensional space 
Arithmetic reasoning 
Vocabulary 
Vocabulary 


Computation 
Arithmetic reasoning 


Threé dimens ional space 


Too} matching 
Form matching 


Name comparisons 


" - Mark making 


Assemble 


 , Disassemble 


Place 
Turn 


Components of the General Aptitude Test Battery = 


30 


2 


Table 1. 
Relationship Between Test: Norms 
(G--105 1 Ve=105 » N--105) and Dichotomized Criterion-- 
Case Worker 195.108: N = 106 


_ Non-Qualifying Qualifying . 
Test Scores Test Scores 
Dood Workers 20 4 
i Poor Workers 18 4 
. TOTAL 38 68 
E * B= .28 “ 
g "pa £..005 
E ’ 
i 
A 
| 31 


References 


Bellamy, G. T., & Snyder, S. The trainee performance sample: Toward the 
prediction of habilitation costs for severely handicapped adults. 
In T. Bellamy (Ed.), Habilitation of severely and profouily retarded 
adults. Eugene, Oregon: Specialized Training Program, University of | | 
Oregon, 1976, 79-90. ms @ | \ 
Budoff, M., §& Hamilton, J. Optimizing test performance of moderately and 
severely mentally retarded adolescents and adults. American Journal 
of Mental Deficiency, 1976, 81(1), 49-57. 
Donlon, T. Referencing test scores: Introductory concepts. In W. Hively 
& M. Reynolds (Eds.), Domain referenced testing in’ special education. 
‘ Reston, Va.: Council on Exceptional Children, 1975, 29-42. . 
Federal Register. Guidelines on employee selection procedures. 35FR12333, 
August 1, 1970. . 
Gold, M. Research on the vocational habilitation of the retarded: The ie 
i present, the future. In N. Ellis {Ed.); International review gé research 
in mental retardation: Volume 6, New York: Academic Press, 1973, 97-148. 
“Halpern, A., Raffeld, P., Irvin, L., § Link, R. Measuring social and prevoca- 
tional awareness in mildly retarded adolescents. fnerican Journal of 
Mental Deficienc cae 80, 81-89. (a) , : 


Pe 
Halpern, A., paffeld, P.; Trvin,*L., & Link, R. - Social and prevocational 
information battery. Monterey, Ca.: CTB/McGraw-Hill, 1975. (b) 
Haywood, H. ‘C., et al. Behavioral assessment in mental retardation. In 
. ee ae 


“~ « 


P, McReynolds (Ed.), Advances in psychological assessment: Volume 3. 


: Washington: Jossey-Bass, 1975, 96-136. . 


32 


Ff 


_# 


. . 
Hively, W. Introduction. In W. Hively § M. Reynolds (Eds.), Domain-referenced 


testing in special:education. Reston, Va.: Council on Exceptional 
‘Children, 1975, 114. 


Hofmeister, A. Integrating criterion-referenced testing and instruction. In 
3 7 . 


W. Hively & M. Reynolds (Eds.), Domain-referenced testing in special 
_ education. Reston, Va.: Council on Exceptional Children, {975, 77-88. ° 


tne Po |g a a. se 
i 
< é > 


Irvin, L., & Halpern, A. -Reliability and validity of the Social and Prevoca- 
tional Information‘Battery for mildly retarded individuals. American 
‘Journal of Mental Deficiency, In press. ; 
Irvin, L., Halpern, A., § Reynolds, W. Social. and Prevocational Information 
Battery - Forn T. Eugene, Oregon: Rehabilitation Research and Training 
Center, University of Oregon, 1977. ne | ' 


~ 


Irvin,L., Halpem, Res & Reynolds, W. Measuring social and prevocational 
awareness in moderately retarded individuals. American Journal of 


Mental Deficiency, In press. 
v 


Joselyn, 6. Ethical considerations in the use of standardized tests... In _ 


W. Hively & M. Reynolds (Eds.), Domain referenced testing in special 
education. Reston, Va.: Council%on Exceptional Children, 1975, 121-140. 


= —e - 


Py 


tq 


Koenig, P. They just changed the rules on how to get ahead. Psychology 
Today, June, 1974, 87-95, 100-103. 
Kulman, et al. The tools of vocational evaluation. Vocational Evaluation anid. 


Work Adjustment Bulletin, 1975, 8, 49-64. 
Lavisky, S. Invited address. In J. Sanders & T. Sachse (Eds.), Problems and 


potentials of applied performance testing. Portland, Oregon: Northwest 
> ’ e r) 
Regional Educational Laboratory, 1975, 33-54. 
. e 


° 


\ 


33 


+ 


i 

g : 
. 

‘4 


McClelland, D. Testing for competence rather than for 'intelligence.' 


American Psychologist, 1973 28, 1-14. 5 oe 
Manual for the USES-General Aptitude Test. Battery. Section III: Development. 


United States Department of Labor, Manpower Administration, 1970. 
Reynolds, M. Trends in special education: Implications for measurement. 
In W. Hlively & M. Reynolds (Eds.), Domain-referenced testing in special © 
education. ‘Reston, Va.: Council on Exceptional Children, 1975, 15-28. 
Rosenberg, B. The job. sample in vocational evaluation. Final Report of 
Project RD-561. New York: Institute for the Crippled and Disabled, 
1967. ; | 
Rosner, J. Testing for teaching in an adaptive educational environment. 
In W. Hively §& M. Reynolds (Eds.), Domain-referenced testing ‘in special 
education. Reston, Va:: Council on Exceptional Children, 1975, ae 
Testing orientation and work evaluation in rehabilitation. ICD Rehabilitation 


and Research Center. New York: 1974. 


- B4 


